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Abstract 



The paper [5] presented a bound on the generalisation error of 
classifiers learned through multiple kernel learning. The bound has 
(an improved) additive dependence on the number of kernels (with the 
same logarithmic dependence on this number). However, parts of the 
^vq , proof were incorrectly presented in that paper. This note remedies this 

' weakness by restating the problem and giving a detailed proof of the 

, Rademacher complexity bound from [5]. 

o 



1 Introduction 

We refer to [5] for the motivation and definitions of multiple kernel learning. 
The paper [5] presented a number of results including a new bound on the 
generalisation error of classifiers learned from a multiple kernel class with 
a logarithmic dependence on the number of kernels used and with that 
logarithm entering additively into the bound - that is independently of the 
complexity of the individual kernels or the margin of the classifier on the 
training set. 

An anonymous referee made us aware of a weakness in the presentation 
of this proof that was subsequently highlighted in a recent tutorial [4]. In 
this short note we give a detailed proof of the bound presented in [5] albeit 
with one constant weakened from 5 to 11. 
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2 Detailed proof 

2.1 Preliminaries 

Let z = {(xi, j/j)}^ be an ?n-sample where ij G A 1 C W l and y« G 3^ = 
{ — 1, +1}, with Z = X x y. Let x = {x±, . . . , x m } contain the input vectors. 

Definition 1 ([1]). A kernel is a function k that for all x, x' G X satisfies 

K (x,x') = (<j>(x),<t>(x')}, 

where <f> is a mapping from X to an (inner product) Hilbert space T~L 

4>:X^U. 

Kernel learning algorithms [7J [8] make use of the m x m kernel matrix 
K = \n{xi, Xi')]Y\ l=1 defined using the training inputs x. When using the 
kernel representation it is not always possible to represent the weight vector 
w explicitly and so we can use the function / directly as the predictor: 

m 

f( x ) = ^ctiyiK{xi,x) = {w,<j)(x)), 
i=l 

where a = (a±, . . . ,a m ) is the dual weight vector and the corresponding 
norm of the weight vector is 

m 

INI 2 = ^2 a iVi a jyj K ( x i^ x j)- 

Given a kernel k, we will use (f> K (-) to denote a feature space mapping sat- 
isfying 

k(x,x') = {4> K (x),(/) K (x')}. 

Hence, learning with a kernel k can be described as finding a function from 
the class of functions [9]: 

JF K = {x i-). (w, (j) K (x)) | |M| 2 <1,} 

minimising the empirical average of the hinge loss 

h y {yf{x)) = max(7 - yf(x),0). 

where we call 7 G [0, 1] the margin. For multiple kernel learning we consider 
a family of kernels /C and the corresponding function class 

?K = {x^ {w,4>k,( x )) I \\M\2 < 1) for some K £ /C} 
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For a distribution T>, we use the notation ¥,x>[f(x)] to denote the expected 
value of f(x) when x ~ D. Given a training set x we denote E[/] to denote 
its empirical average over the sample x. 

For the generalisation error bounds we assume that the data are gener- 
ated iid from a fixed but unknown probability distribution D over the joint 
space X x y . Given the true error of a function /: 

err(/) = E {x ^ v (yf(x) < 0) = E v [yf(x)}, 

the empirical margin error of / with margin 7 > 0: 

err 7 (/) = -Vlfe/^) < 7) = EpTfa/fo) < 7)] 

where I is the indicator function, and the estimation error est 7 (/) 

est 7 (/) = |err(/) - efr 7 (/)|, 

we would like to find an upper bound for est 7 (/). In the sequel we will state 
the bounds in standard form, where the true error err(/) of a function / is 
upper bounded by the empirical margin error efr 7 (/) plus the estimation 
error est 7 (/): 

err(/) < efr 7 (/) + est 7 (/). (1) 
We further consider the function: 

0; if s > 7 

Af(s) = { 1 - a/7; if < s < 7; 

1; otherwise, 

and its empirical estimation E[^4 7 (y/(x))]. Note that err(/) < Ex>[^4 7 (?//(x))], 
K[A^yf(x))} < err 7 (/) and E[A^yf(x))} < E[h^(yf(x))). 

Let K, = {k\, . . . , Kp\ denote a family of kernels, where each kernel Kj 
is called the jth base kernel. The following kernel family is formed using a 
convex combination of base kernels: 

{p p 
k x = ^2 XjKj 1 Xj > 0, Xj = 1 
3=1 3=1 

Note, p is the complexity of the kernel family (i.e., cardinality of the set of 
base kernels). 
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2.2 Additive Rademacher complexity bound for MKL 

In this section we derive our additive Rademacher bound from [5], using 
considerably more detail. We begin by stating the following well-known 
concentration inequality, followed by a definition of Rademacher complexity. 

Theorem 1 ([6]). Let X\, . . . ,X m be independent random variables taking 
values in a set A, and assume that f : A m \— > R satisfies 

sup \f(xi, ■ ■ ■ ,x m ) - f(x±, . . . ,Xi,x i+1 , . . .,x m )\ < a, 1 < i < m. 
Then for all e > 



Definition 2 (Rademacher complexity). For a sample x = {x\ , . . . , x m } 

generated by a distribution T>x on o, set X and a real-valued function class 
T with domain X , the empirical Rademacher complexity of J- is the random 
variable 



where a = (crx, . . . , o~ m ) are independent uniform {H}-valued (Rademacher) 
random variables. The (true) Rademacher complexity is: 



The standard margin-based Rademacher bound for learning theory is 
given in the following theorem. 

Theorem 2 ([2j). Fix 7 > and 5 € (0, 1), and let J 7 be a class of functions 
mapping from Z = X x y to [0,1]. Let z = {z{\ 1 ^ =l be drawn independently 
according to a probability distribution T>. Then with probability 1 — 5 over 
random draws of samples of size m, every f £ T satisfies 





RmiJ 7 ) = E x Rm(jF) = IEx<t SUp — V" Oif{xi) 

£ r- T~ m, *■ » 
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We have attributed this bound to [3] though strictly speaking they used 
the slightly weaker version of Rademacher complexity including an absolute 
value of the sum. This version is obtained by a slight tightening of the 
argument. This bound is quite general and applicable to various learning 
algorithms if an empirical Rademacher complexity i? m (J r ) of the function 
class J- can be found efficiently. For kernel method algorithms a well-known 
result uses the trace of the kernel matrix to bound the empirical Rademacher 
complexity. 

Theorem 3 ([3]). If K : X x X >— >M is a kernel, and x = {x±, . . . , x m } is a 
sample of points from X, then the empirical Rademacher complexity of the 
class F K satisfies 



Furthermore, if R 2 > k(x, x) for all x 6 X and k is a normalised kernel 
such that K (xi,X{) = m then we have: 



The problem of learning kernels from a convex combination of base ker- 
nels is related to using the convex hull of a set of functions. Consider 



Since adding kernels corresponds to concatenating feature spaces, it is 
clear that (here Wj is the restriction of w to the feature space defined by the 
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since by Cauchy Schwartz 



p p p 



wA\ 2 < 1. 



Hence, we are interested in the empirical Rademacher complexity of a convex 
hull as given by Equation (j^J), which is well known to be bounded by 



Furthermore, we have the following result. 

Theorem 4 (|2j). The empirical Rademacher complexity of the function 
class C(J-) where £(•) is Lipschitz function with Lipschitz constant L is 
bounded by 



Given all of the results from above, we are now in a position to state 
the following theorem, which proves a high probability upper bound for the 
empirical Rademacher complexity of a joint function class T = Uj=i 

Theorem 5. Let x = {xi, . . . ,x m } be an m-sample of points from X, then 
with probability at least 1 — 5 the empirical Rademacher complexity R m of 
the class J- = {J P j =1 JFj, where the range of all the functions in J- is [0,1], 



Proof. From McDiarmid's inequality we have the following probabilistic up- 
per bound (for some function g(x) = g(xi, . . . , x m ) satisfying sup x £ . |g(x) — 
g(xx, ...,x i} .. .,x m )\ < c): 



Pr{g(X 1 ,...,X m )-E\g(X 1 ,...,X m )] > e} < exp ^ = : ( 4 ) 



i? m (con(^)) < Rm{F). 



R m {C(F)) < LR m {F). 



satisfies: 





and conversely: 



X m ) 



E[g(X 1 ,...,X m )]<e} >l-5. 



(5) 



Let us define 




where a = (<ti, . . . , er m ), and 



ffo(o-) = - sup — Y]aif(o 



i=l 



Note that for all of these functions c = 4/m so that 



5 = exp 



me 



so that solving for e gives 



'81n(l/£) 



m 



We would like to upper bound the following empirical Rademacher com- 
plexity, 



sup 



2 

— YWOi) 

z — ^ 



i=l 



Xl , . . . , x r 



(6) 



We will ignore the conditioning from now on. From Equation dSJ with 
probability 1 — <5 we can upper bound the expectation E CT [<7o(o")] by: 



-E a [g Q (a)} < -g (a*) + e, 



sup — y^(Tif(xi) 



i=l 



< 



Sup — Vcr*/(i 



1=1 



where <t* = (<t 15 . . . , er^J is a realisation of a Rademacher sequence. This 
'trick' allows us to remove the expectation. We know that the supremum of 
a joint function class can be upper bounded by the max over the supremum 
of each of the function classes. Hence, with probability at least 1 — 5 this 
gives us: 



sup — yVj/(a 



< max 

i<i<P 



sup — yv*/(>i) 



m i=l 



(7) 



max o,-((T*) + e. 

i<i<P J 



Next we can make another application of Equation © to have a bound in 
terms of §j. Using the union bound we have with probability 1 — p5 for 
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1 < j <p: 



9j (a*) < E a [ gj (a)] + e, 



sup — YV*/(Xi) 



i=l 



< E n 



SUP — y^(Tif(Xi) 



i=i 



+ e. 



(8) 



Therefore substituting Equation ([8]) into Equation ([7|) gives us with proba- 
bility 1 — (p + 1)5 an upper bound on the empirical Rademacher complexity 
of a joint function class: 



RmiJ 7 ) < max E CT 

i<i<P 



SUP — V <7j/(i 



i=i 



+ 2e. 



= max R m {Fj) + 2e. 

i<i<p 

Finally substituting j = <5/(p + l) gives 



(9) 



log((p + !)/£) 
2m 



Hence, with probability 1 — 5 we have the final result: 



R m {F) < max Rm{Fj) + 



log((p + !)/<?) 
2m 



(10) 
□ 



Recall the function ^4 7 (-) and the properties err(/) < Ex>[.A 7 (y/(:z;))] 
and E[^4 7 (y/(x))] < err 7 (/). Therefore we have the following generalisation 
error bound for MKL in the case of a convex combination of kernels. 

Theorem 6. Fix 7 > and 5 E (0, 1). Let K, = {k±, . . . , k p } be a family of 
kernels containing p base kernels and letz = {zi}™ =1 be a randomly generated 
sample from distribution D. Then with probability 1 — 5 over random draws 
of samples of size m, every f E Jx con satisfies 



err(/) < EUF(yf(x))] + max 

7m i<j<p ' 



^Kjix^Xi) + 11 
\ i=l 



H(p + 3)/5) 
2m 



Also, if each kernel Kj is normalised and bounded by R 2 > Kj(x,x) for all 
x E X and j E {1, . . . ,p}, we have: 



err(/) < E 
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Proof. Each kernel Kj defines a function class 

T- = {in (w,^> Kj (x)) : \\w\\ < 1}. 
Hence, applying Theorem [2] to the class of functions 

we have: 

err(/) < Ep[^(y/(x))] 



< E[A^(yf(x))] + matt^^i)) + n V ^2^T^ 



< EL4 7 (y/(x))] + — max 
7m i<i<P 



/ ln((p + 3)/*) 
^^(s^Xi) + 11^/ — 



i=l 



V m V 2m 

where the second line is given by Theorem [2] with 28/ (p + 3) in place of 
S, the third line comes from applying Theorem [5] with (p + 1)5/ (p + 3) in 
place of 5, while the fourth uses a combination of Theorem H] (note that 
-4 7 (-) is Lipschitz with Lipschitz constant L = 1/7) and the first inequality 
in Theorem [3l The final line is obtained by applying the second inequality 
in Theorem [3] for the case when each kernel kj is bounded by R 2 . □ 

3 Discussion 

Theorem 8 presented in our AISTATS paper |5] is (using all the notation 
from above and only presenting the unnormalised version): 



, v c /M(p+3)/<y) 



2 

err(/) < err 7 (/)H max, 

7m i<j<p \ V 2m 

Comparing this to Theorem [6l we can see the only differences are: 

• A constant of 11 instead of 5. 

• Using the tighter empirical loss E[^l 7 (y/(x))] as opposed to err 7 (/). 
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It is important to note that the bound is additive in the sense that it com- 
bines the logarithm of the number of kernels and the margin additively. 
Furthermore for this scenario it is tighter than any previously published 
bounds in the i\ norm regularised MKL literature. 
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